Towards Building Error Resilient GPGPU Applications
نویسندگان
چکیده
GPUs (Graphics Processing Units) have gained wide adoption as accelerators for general purpose computing. They are widely used in error-sensitive applications, i.e. General Purpose GPU (GPGPU) applications However, the reliability implications of using GPUs are unclear. This paper presents a fault injection study to investigate the end-to-end reliability characteristics of GPGPU applications. The investigation showed that 8% to 40% of the faults result in Silent Data Corruption (SDC). To reduce the percentage of SDCs, we propose heuristics to selectively protect specific elements of the application and design fault detectors based on heuristics. We evaluate the efficacy of the detectors in reducing SDCs and measure performance overheads of the detectors. Our results show that the heuristics are able to reduce the SDC causing faults by 60% on average, while incurring reasonable performance overheads (35% to 95%).
منابع مشابه
Evaluating the Error Resilience of GPGPU Applications
Over the past years, GPUs (Graphics Processing Units) have gained wide adoption as accelerators for general purpose computing. A number of studies [1, 2] have shown that significant performance gains can be achieved by deploying GPUs on traditional high performance computing (HPC) systems that host demanding scientific applications. However, the reliability implications of using GPUs are unclea...
متن کاملError Resilience Evaluation on GPGPU Applications
While graphics processing units (GPUs) have gained wide adoption as accelerators for general-purpose applications (GPGPU), the end-to-end reliability implications of their use have not been quantified. Fault injection is a widely used method for evaluating the reliability of applications. However, building a fault injector for GPGPU applications is challenging due to their massive parallelism, ...
متن کاملTowards Multi-tenant GPGPU: Event-driven Programming Model for System-wide Scheduling on Shared GPUs
Graphics processing units (GPUs) are attractive to the generalpurpose computing (GPGPU) beyond the graphics purpose. Sharing GPUs among such GPGPU applications is a key requirement especially for cloud platforms whose resources are utilized by various cloud users. However, consolidating recent GPU applications, referred to as GPU eaters, on a GPU poses a new challenge. Such advanced application...
متن کاملSoft Error Resilient QR Factorization for Hybrid System
As the general purpose graphics processing units (GPGPU) are increasingly deployed for scientific computing for its raw performance advantages compared to CPUs, the fault tolerance issue has started to become more of a concern than before when they were exclusively used for graphics applications. The pairing of GPUs with CPUs to form a hybrid computing systems for better flexibility and perform...
متن کاملFault injection on GPGPU application
Today, with the development of GPU computing techniques in terms of architectures and hardware and software support, people realized that intensive computing workload could be ported to GPU device. Applications could exploit GPUs’ characteristics for parallel computing and gain a significantly high speedup comparing to CPU architecture. However, failures are still unavoidable. People have alrea...
متن کامل